logo

You can find all files in github.

1 Before we start

Before you start downloading data from the internet, you need to create a project in R-Studio. The name of the project, as well as its file path should NOT include the following:
1. Greek letters
2. spaces (replace the spaces with _)

For example your file path could be: C:/Environmental_Data_Analysis/Geographical_Data.

You also need to install and load all the packages that we are going to use today1, if you have not done that already.

2 Online databases

2.2 Download vector data for Greece

We can download some vector data in the form of polygons. We can use the online database GADM to download country boundaries.

We will not delve into any details for now - we will do that in the next practical. We need these data for visualization purposes2.

We can easily plot the data we have just downloaded.

It is equally easy to subset those data.

## class       : SpatialPolygonsDataFrame 
## features    : 326 
## extent      : 19.37236, 29.6457, 34.80069, 41.74801  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 16
## names       : GID_0, NAME_0,   GID_1,                      NAME_1,                                                                                        NL_NAME_1,     GID_2,         NAME_2,                        NL_NAME_2,       GID_3,   NAME_3,          VARNAME_3,                                                     NL_NAME_3, TYPE_3,    ENGTYPE_3, CC_3, ... 
## min values  :   GRC, Greece, GRC.1_1,                      Aegean,                                                                 <U+0386><U+03B8><U+03C9><U+03C2>, GRC.1.1_1,          Athos, <U+0386><U+03B8><U+03C9><U+03C2>, GRC.1.1.1_1,   Abdera, Abelokipi-Menemeni,                              <U+0386><U+03B8><U+03C9><U+03C2>,  Dímos, Municipality,   NA, ... 
## max values  :   GRC, Greece, GRC.8_1, Thessaly and Central Greece, Tessa<U+03BB><U+03AF>a <U+03BA>a<U+03B9> Ste<U+03C1>e<U+03AC> <U+0395><U+03BB><U+03BB><U+03AC>da, GRC.8.2_1, West Macedonia,           Tessa<U+03BB><U+03AF>a, GRC.8.2.9_1, Zografou,           Zografos, Tessa<U+03BB><U+03BF><U+03BD><U+03AF><U+03BA><U+03B7><U+03C2>,  Dímos, Municipality,   NA, ...

We see that our object has a Coordinate Reference System (crs) and that it includes several columns (as indicated by the slot names).

Next, we are going to see how many columns our object includes:

##  [1] "GID_0"     "NAME_0"    "GID_1"     "NAME_1"    "NL_NAME_1" "GID_2"    
##  [7] "NAME_2"    "NL_NAME_2" "GID_3"     "NAME_3"    "VARNAME_3" "NL_NAME_3"
## [13] "TYPE_3"    "ENGTYPE_3" "CC_3"      "HASC_3"

Finally, we are going to create a new spatial object by subsetting Greece. We do so, by using one of the columns present in Greece.

Let’s plot Crete now

2.3 Introduction

As we have mentioned earlier, the most basic category of vector data are points. In ecology, points most often represent sampling locations or field observation in general. These observations consist primarily species presence in one site. Nowadays, it is fairly easy to obtain species occurrence data through freely available, online databases, such as GBIF, BIEN and iNaturalist. Bear in mind these databases refer to current/recent species occurrences, yet there are others, such as the Paleobiology and Neotoma databases3, that contain information regarding fossil data. There are other databases as well, such as the PREDICTS and the COMPADRE, which deal with time-series and population matrices, respectively. Obviously if someone’s interested in conservation-related data, then the IUCN is the best place to search for this kind of data. You can find several other databases here.

Today we are going to explore the GBIF and the iNaturalist databases, but you are free to explore the other ones as well. The former contains more than 1.3 billion records, which can be easily accessed for ecological analyses, while the latter holds significantly less records and is essentially a citizen-science database. The ease of access and use is slightly counterbalanced by the fact that those databases contain several erroneous occurrences (e.g., misidentifications in the case of iNaturalist, coordinates falling into the sea in the case of GBIF - see Maldonado et al.(2015) for more).

Let’s first download some point data for Petromarula pinnata, one of the six endemic genera of Greece.

Fig 1. The endemic species Petromarula pinnata from Crete - Photo from idr.

Fig 1. The endemic species Petromarula pinnata from Crete - Photo from idr.

2.3.1 iNaturalist

First we will download data from the iNaturalist database. In order to merge those data with the one we will download from GBIF, we will have to add a column named countryCode and remove any rows with no geographical information4.

Next, we will have to perform some checks in order to remove any erroneous or suspicious coordinates.

You could run the above functions for a species that might interest you.

2.3.2 GBIF

Let’s now download data from GBIF for the same species. The first step is to check that the species is indeed referenced in GBIF. The function name_suggest() from the rgbif package searches all species that more or less fit the request. Only the species names from GBIF that exactly match the name of our species, are kept.

The previous step identifies the taxonomic reference number of our species. Using this reference is the safest way to request species occurrences in GBIF.

Now let’s download all the occurrences that are present in GBIF and have geographical information (We will download max. 1000 occurrences)

## 
## Petromarula_pinnata A.DC.      Phyteuma_pinnatum L. 
##                       116                         1

It’s not unusual GBIF to contain (at least some) data from iNaturalist, so we will remove any occurrences that come from the latter database.

## [1] 26

Next, we will have to perform again some checks in order to remove any erroneous or suspicious coordinates.

You could run the above functions for a species that might interest you.

2.3.3 Join iNaturalist and GBIF data

We are now in place to join both datasets and convert them into a spatial object.

Let’s visualise the result.

## ===========================================================================##
## First convert the Dataset column to a factor
## ===========================================================================##
flags_both$Dataset <- factor(flags_both$Dataset)
## ===========================================================================##


## ===========================================================================##
## Then add the centroids of each Cretan county
## ===========================================================================##
Crete_sf <- cbind(Crete_sf, st_coordinates(st_centroid(Crete_sf)))
## ===========================================================================##


## ===========================================================================##
## Then do the plotting
## ===========================================================================##
ggplot(Crete_sf) + geom_sf(fill = "antiquewhite1") + geom_sf() + # geom_label(data = Crete_sf, aes(X, Y, label = NAME_3), size = 2, fontface =
# 'bold') +
geom_sf(data = Crete_sf, fill = NA) + geom_sf(data = flags_both, size = 2, aes(color = Dataset, 
    fill = Dataset, shape = Dataset)) + annotation_scale(location = "bl", width_hint = 0.4) + 
    annotation_north_arrow(location = "bl", which_north = "true", pad_x = unit(0.75, 
        "in"), pad_y = unit(0.5, "in"), style = north_arrow_fancy_orienteering) + 
    coord_sf(xlim = c(23.4, 26.4), ylim = c(34.7, 35.7), expand = FALSE) + xlab("Longitude") + 
    ylab("Latitude") + ggtitle("Occurrences available online", subtitle = expression(italic("Petromarula pinnata"))) + 
    theme(panel.grid.major = element_line(color = gray(0.5), linetype = "dashed", 
        size = 0.5), panel.background = element_rect(fill = "aliceblue"))

2.3.4 Download GBIF and iNaturalist data for a specified polygon

We can also download species occurrences data from both databases by providing a polygon that we are interested in. Let’s see how that works for iNaturalist and GBIF. We shall do it first for the GBIF database.

## ============================================================================##
## First, create a vector of the species you are interested in
## ============================================================================##
splist <- c("Centaurea idaea", "Petromarula pinnata", "Dianthus juniperinus")
## ============================================================================##


## ============================================================================##
## Download data for them
## ============================================================================##
keys <- sapply(splist, function(x) name_suggest(x)$data$key[1], USE.NAMES = FALSE)
## ============================================================================##


## ============================================================================##
## Create an WKT file for the download
## ============================================================================##
ex <- raster::extent(Crete)
ex <- as(ex, "SpatialPolygons")
ex <- rgeos::writeWKT(ex)

## It seems that there is a problem with the number of digits, so we trimmed them
ex2 <- "POLYGON((23.47347 34.80068, 23.47347 35.69597, 26.35347 35.6959, 26.35347 34.80068, 23.47347 34.80068))"
## ============================================================================##


## ============================================================================##
## Download the data for these species from GBIF ----
## ============================================================================##
dat_ne <- occ_search(taxonKey = keys, hasCoordinate = T, limit = 1000, geometry = ex2)
## ============================================================================##


## ============================================================================##
## Convert it to a dataframe ----
## ============================================================================##
for (i in seq_along(dat_ne)) {
    dat_ne[[i]] <- dat_ne[[i]]$data
}

dat_ne <- lapply(dat_ne, "as.data.frame")

dat_ne <- bind_rows(dat_ne)
## ============================================================================##

Try running the above functions for some species that might be present in Epirus (e.g., Ophrys helenae).

And then, in a similar manner for the iNaturalist database.

## class      : Extent 
## xmin       : 23.47347 
## xmax       : 26.35347 
## ymin       : 34.80069 
## ymax       : 35.69597

2.3.5 Download GBIF data for a specific country

Finally, we can download all the plant species that occur in a country (or a specified polygon as exemplified earlier) and are present in GBIF.

##============================================================================##
## Use the name_suggest function to get the gbif taxon key
##============================================================================##
tax_key <- name_suggest(q = "Magnoliopsida", rank = "Class")$data
##============================================================================##


##============================================================================##
## Sometimes groups have multiple taxon keys, so we will
## check how many records are available for them
##============================================================================##
lapply(tax_key$key, "occ_count")


##============================================================================##
## The first one is relevant
##============================================================================##
tax_key <- tax_key$key[1]
##============================================================================##


##============================================================================##
## Let's download some plant species for Crete
##============================================================================##
data_gr <- occ_search(taxonKey = tax_key, 
                      hasCoordinate = T, ## We only want data with coordinates
                      geometry = ex2, ## Here we define our study area polygon
                      limit = 3000) ## We only want 3000 occurrences
##============================================================================##


##============================================================================##
## And keep only the actual data
##============================================================================##
data_gr <- data_gr$data
##============================================================================##

2.3.6 Download GBIF data for a specific country

Even though we can download many species occurrences from within R, there is limit to how many of these occurrences we can actually download.

So, we have downloaded all the plant occurrences for Italy5. Let’s now load them and start exploring them.

This can be a lengthy process depending on your PC’s specificiations. You can load both the raw and the cleaned data with the following commands: data_it <- readRDS('Data/RDS/Italian data.rds') and flags_gbif <- readRDS('Data/RDS/Italian data cleaned.rds')

We can now visualise these data. Let’s see how many occurrences there are for each taxonomic unit (taxonRank), as well as which institutions host these plant taxa (institutionCode).

taxonRank n
CLASS 3
FAMILY 1839
FORM 1332
GENUS 14662
ORDER 8
PHYLUM 9
SPECIES 922917
SUBSPECIES 97262
VARIETY 7533
institutionCode n
482596
BIO-UNIPI 308257
iNaturalist 102451
naturgucker 33354
INFOFLORA 32041
TLMF 15110

As a next step, let’s see which Order, which Family, which Genus and which Species have the most occurrences in Italy, according to GBIF.

We can also check the temporal dimension of these data (e.g., when were they collected?):

Run the above for the cleaned dataset. Do you observe any differences?

Hint: You can use skim(data_it) and skim(flags_gbif) for a quick overview.

Which species of the genus Silene has the most occurrences in Italy?

2.3.7 Download GBIF data for several species

As mentioned earlier, there is a limit in how many species we can download from within R. Let’s say that we have 150 tree species for which we want to download all of their occurrences from most Mediterranean countries. If you look at GBIF, you’ll see that these occurrences are \(> 8M\) - that’s quite a lot.

So what can we do? First of all, download and save in your working directory an excel file with the names of those species.

Then, we are going to match those names with the names present in GBIF:

As a next step, we are going to create a query, asking the GBIF portal to download these data for us.

So now we can download the data we requested6. In the following few lines of code, we will see how we can import these massive data (4.7GB) into R, but it is for demonstrantion purposes only.

Do not run this code now in your PC, since it may take some time in order to read the data7. You can read into R that excel file, using the fread function from the data.table R package:

References

Maldonado, C., Molina, C.I., Zizka, A., Persson, C., Taylor, C.M., Albán, J., Chilquillo, E., Rønsted, N., & Antonelli, A. (2015) Estimating species diversity and distribution in the era of b ig d ata: To what extent can we trust public databases? Global Ecology and Biogeography, 24, 973–984.


  1. There are two ways to do that fast:
    1. library(pacman); p_load('package_name_1', 'package_name_2')
    2. install.packages(c('package_name_1', 'package_name_2'), dependecies = TRUE)
    Remember that if you use the first way, you need to install package pacman first.

  2. You are free to ask whatever you want however

  3. There is a nice tutorial on Neotoma available here

  4. With no coordinates to be exact

  5. You can download this file from the github repository or on your own from GBIF.

  6. It needs some time to do that, so we have to be a little bit patient.

  7. You can do that after the webinar finishes, if you want.

 

Delivered to you by Dr. Kostas Kougioumoutzis

kkougiou@aua.gr